This guide provides an overview of what a Differential Expression Analysis report created by the BigMind team contains. For the integrated and automated analysis, we utilized functions from our custom R package, OmicsKit.
Sample metadata is crucial for connecting bioinformatics results to biological significance. The first step in our analysis is to clean the data and prepare it for downstream library requirements.
| ID | Test | Condition | Group | Sex | Age | RIN |
|---|---|---|---|---|---|---|
| 1 | Undetected | Normal | A | F | 65 | 6.0 |
| 2 | Undetected | Normal | A | F | 42 | 6.0 |
| 3 | Undetected | Normal | A | M | 63 | 5.5 |
| 4 | Undetected | Normal | A | M | 49 | 5.8 |
| 5 | Undetected | Disease | B | M | 51 | 5.2 |
| 6 | Undetected | Disease | B | M | 53 | 6.1 |
| 7 | Detected | Normal | C | F | 31 | 5.5 |
| 8 | Detected | Normal | C | F | 54 | 5.5 |
| 9 | Detected | Normal | C | M | 43 | 5.3 |
| 10 | Detected | Normal | C | M | 67 | 5.7 |
| 11 | Detected | Normal | C | M | 59 | 6.1 |
| 12 | Detected | Normal | C | M | 46 | 5.2 |
| 13 | Detected | Normal | C | M | 40 | 5.5 |
| 14 | Detected | Disease | D | F | 70 | 5.6 |
| 15 | Detected | Disease | D | F | 36 | 5.9 |
| 16 | Detected | Disease | D | F | 49 | 6.0 |
| 17 | Detected | Disease | D | F | 64 | 6.6 |
| 18 | Detected | Disease | D | F | 66 | 5.6 |
| 19 | Detected | Disease | D | M | 57 | 5.4 |
| 20 | Detected | Disease | D | M | 51 | 6.4 |
| 21 | Detected | Disease | D | M | 48 | 6.7 |
Identifying samples with abnormal counts can highlight outliers that may need to be excluded from analysis to avoid skewed results.
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## summarizing abundance
## summarizing counts
## summarizing length
Consider variables such as age, sex, batch effects, or other biological and technical factors that might impact gene expression. ## Heatmap samples vs samples
pca12 <- nice_PCA(transf.data, PCs = c(1, 2),
ntop = nrow(assay(transf.data)),
variables = c("gp", "sex"),
legend_names = c("Group", "Sex"),
size = 9, alpha = 1, colors = nice_colors,
shapes = 21:22, legend_title = 10,
legend_elements = 8, legend_pos = NULL,
labels = c(var = "num", size = 3))
p.tsne <- nice_tSNE(object = transf.data, seed = seed, perplexity = plex,
max_iterations = iter, returnData = FALSE,
variables = c("group", "sex"), colors = nice_colors,
shapes = 21:22, size = 9, alpha = 1,
legend_names = c("Group", "Sex"),
legend_title = 10, legend_elements = 8, legend_pos = NULL,
labels = c(var = "num", size = 3))
## sigma summary: Min. : 0.592256294214936 |1st Qu. : 0.80258671064056 |Median : 0.883021151140274 |Mean : 0.95360027035352 |3rd Qu. : 1.06449277487991 |Max. : 1.58782782421574 |
## Epoch: Iteration #1000 error is: 0.800904491733047
## Epoch: Iteration #2000 error is: 0.509953545086777
## Epoch: Iteration #3000 error is: 0.506591721778965
## Epoch: Iteration #4000 error is: 0.506573987936383
## Epoch: Iteration #5000 error is: 0.506572323927411
## Epoch: Iteration #6000 error is: 0.50657227929598
## Epoch: Iteration #7000 error is: 0.506572278941316
## Epoch: Iteration #8000 error is: 0.50657227892593
## Epoch: Iteration #9000 error is: 0.50657227890614
## Epoch: Iteration #10000 error is: 0.506572278880658
DESeq2 is a widely used tool for analyzing differential gene expression in RNA-seq data. We performed our analysis adjusting the analysis and improving filtering. ## Create Comparisons
## using pre-existing normalization factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
| baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | ensembl | symbol | description | biotype | |
|---|---|---|---|---|---|---|---|---|---|---|
| ENSG00000000003 | 4.8597885 | -1.3065575 | 0.7629200 | -1.7125747 | 0.0867908 | NA | ENSG00000000003 | TSPAN6 | tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] | protein_coding |
| ENSG00000000005 | 0.0306208 | 0.5719487 | 4.3949775 | 0.1301369 | 0.8964581 | NA | ENSG00000000005 | TNMD | tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] | protein_coding |
| ENSG00000000419 | 183.1677539 | -0.0499737 | 0.1492696 | -0.3347882 | 0.7377848 | 0.9888945 | ENSG00000000419 | DPM1 | dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005] | protein_coding |
| ENSG00000000457 | 239.4758468 | 0.0334553 | 0.1156794 | 0.2892072 | 0.7724228 | 0.9921239 | ENSG00000000457 | SCYL3 | SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285] | protein_coding |
| ENSG00000000460 | 32.1310287 | 0.1098629 | 0.2650042 | 0.4145706 | 0.6784562 | 0.9813005 | ENSG00000000460 | C1orf112 | chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565] | protein_coding |
| ENSG00000000938 | 4095.0035816 | -0.0322014 | 0.1230765 | -0.2616373 | 0.7936011 | 0.9927180 | ENSG00000000938 | FGR | FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697] | protein_coding |
When performing differential expression analysis of a study with 3 phenotypes, including the baseline, there are 10 mutually exclusive cases where genes can fall into.
split_cases <- function (df.BvsA = NULL, df.CvsA = NULL,
df.BvsC = NULL, unique_id = "ensembl",
significance_var = "padj", significance_cutoff = 0.25,
change_var = "log2FoldChange", change_cutoff = 0)
This function identifies genes with measurable expression levels across samples.
detect_genes <- function(norm.counts, df.BvsA, df.CvsA = NULL,
df.DvsA = NULL, cutoffs = c(50, 50, 0),
samples.baseline, samples.condition1,
samples.condition2 = NULL, samples.condition3 = NULL)
GO enrichment analysis with clusterProfiler helps interpret differential expression results by identifying biological pathways that are overrepresented in differentially expressed genes, providing insights into the underlying biological mechanisms.
Linear regression models utilize RNA-seq data to identify relationships between gene expression levels and clinical outcomes, enabling the development of prognosis models that predict disease progression or patient survival based on gene expression profiles.